Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium
نویسندگان
چکیده
The Linguistic Data Consortium (LDC) creates a variety of linguistic resources – data, annotations, tools, standards and best practices – for many sponsored projects. The programming staff at LDC has created the tools and technical infrastructures to support the data creation efforts for these projects, creating tools and technical infrastructures for all aspects of data creation projects: data scouting, data collection, data selection, annotation, search, data tracking and workflow management. This paper introduces a number of samples of LDC programming staff’s work, with particular focus on the recent additions and updates to the suite of software tools developed by LDC. Tools introduced include the GScout Web Data Scouting Tool, LDC Data Selection Toolkit, ACK Annotation Collection Kit, XTrans Transcription and Speech Annotation Tool, GALE Distillation Toolkit, and the GALE MT Post Editing Workflow Management
منابع مشابه
Annotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium
Large-scale corpus development demands substantial infrastructure. As part of this infrastructure, the Linguistic Data Consortium (LDC) has adopted the Annotation Graph Toolkit (AGTK) as a primary resource for annotation tool development. This paper reports on LDC’s experiences using AGTK to develop and implement highly customized annotation tools for a variety of large-scale corpus creation ef...
متن کاملLanguage Resource Creation and Distribution at the Linguistic Data Consortium: A Progress Report
Changes in the supply of and demand for language resources continues to affect the role of large data centers such as the Linguistic Data Consortium (LDC) and European Language Resource Center (ELRA) within the research communities they serve. The past few years have seen increased demand for: intensively multi-modal resources, larger data sets in high-density languages and new data in low dens...
متن کاملQuality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora
The Linguistic Data Consortium at the University of Pennsylvania has recently been engaged in the creation of large-scale annotated corpora of broadcast news materials in support of the ongoing Topic Detection and Tracking (TDT) research project. The TDT corpora were designed to support three basic research tasks: segmentation, topic detection, and topic tracking in newswire, television and rad...
متن کاملA New Phase in Annotation Tool Development at the Linguistic Data Consortium: The Evolution of the Annotation Graph Toolkit
The Linguistic Data Consortium (LDC) has created various annotated linguistic data for a variety of common task evaluation programs and projects to create shared linguistic resources. The majority of these annotated linguistic data were created with highly customized annotation tools developed at LDC. The Annotation Graph Toolkit (AGTK) has been used as a primary infrastructure for annotation t...
متن کاملShared Resources for Multilingual Information Extraction and Challenges in Named Entity Annotation
Progress in natural language processing requires increasing amounts of data and annotation in a growing variety of languages, and research in named entity extraction is no exception. While the value of richlyannotated, large-scale multilingual corpora is undeniable, costs for producing such data are high, underscoring the value of shared resources. As part of the US Governmentsponsored Automati...
متن کامل